(G)Linear mixed-effect models

Laurel Brehm

What’s an MEM?

MEM = a mixed effect model. That means it’s a model that has fixed and random effects.

Anything you can analyse with a repeated measures ANOVA, you can analyse with MEM. In addition, MEM allows crossed random effects, both categorical AND continuous predictors, and is good for unbalanced data sets. It’s a very versatile tool.

First lmer model

We’ll start out with a simple model that has a continuous dependant measure (reaction time; RT) and a single continous predictor (raw frequency). In the experimental design, data were collected from many people (subjects) who observed many items (words). Each person saw each word and made a decision about it. In addition, each word is associated with one frequency level.

In model terms: Predict the dependent measure RT from the fixed effect of raw word frequency, plus random intercepts for subjects and items.

First lmer model: fixed effects

Fixed effects are things that are measured/manipulated on purpose: things you predicted could have an effect on the dependent measure. These are observations not drawn at random, so their effects are allowed to vary away from zero. Frequency is a fixed effect because we manipulated it in the experiment. So is Native Language. Start by making a plot of these:

ggplot(lexdec,aes( x=Frequency, y=RT,color=NativeLanguage,shape=NativeLanguage,lty=NativeLanguage)) + geom_point() + geom_smooth(method='lm') + theme_bw() + scale_color_viridis_d()

First lmer model: random effects

Random effects are things that are drawn at random (hence the name). If you have a ‘repeated measures’ design, the repeated measures will almost always be your random intercepts. In the model, their average effect will always be zero, but they may capture variance. Each person might respond a little differently, and each word might be a little different, even if it has the same level of frequency.

Random slopes are like a slope term in the fixed effect portion of the model (a predictor), but instead, we assess whether they should vary by the group that defines the random intercept.

Make plots of these as well:






Building a model

Now put these pieces together for a model. It’s a similar syntax to lm, but we’re adding an ‘er’ to the end, and a piece for the random terms:

model output <- lmer ( DV ~ 1 + Fixed Effects + (1 + Random Slope | Random Intercept), data=Data)

We will also include an optional argument today, REML = F. This means “restricted maximum likeihood” is NOT the methods we’ll use– we will use maximum likelihood. This is useful for model comparisons, which is something we will do today to evaluate the effect of frequency.

The 1 + before the fixed effects is actually also optional. It means ‘give me an intercept’, but the model will by default, give you an intercept.

Building a model

Here is the maximal model involving frequency and native language that is justified by our data: include the interaction between them (as a fixed effect), plus a random intercept for Subject and Word, and a random slope for Frequency by Subject, and a random slope of Native Language by Word.

Think: why is this random effect structure the maximal one justified?

contrasts(lexdec$NativeLanguage) <- c(-.5, .5)

mem1 <-  lmer ( RT ~ 1 + Frequency*NativeLanguage  + (1 + Frequency | Subject) +  (1 + NativeLanguage | Word), data=lexdec, REML=F)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.0178724
## (tol = 0.002, component 1)

Convergence problems…

This model fails to converge. Looking at the random effects only (or Variance components and Correlations), we can see why: the random slope for NativeLanguage is highly correlated with the random intercept for Word. Take it out.

(Often models will also not converge if there is an effect that accounts for almost no variance– there will sometimes be a warning about Singular Fit in this case. It’s also the case that as best practice, you should always take out higher-order terms that account for little variance, starting with any interactions in random slopes.)

VarCorr(mem1)
##  Groups   Name            Std.Dev. Corr  
##  Word     (Intercept)     0.056464       
##           NativeLanguage1 0.034831 0.930 
##  Subject  (Intercept)     0.183642       
##           Frequency       0.014381 -0.864
##  Residual                 0.169873

Refit

This model fits better!

mem2 <-  lmer ( RT ~ 1 + Frequency*NativeLanguage  + (1 + Frequency | Subject) +  (1  | Word), data=lexdec, REML=F)

summary(mem2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: 
## RT ~ 1 + Frequency * NativeLanguage + (1 + Frequency | Subject) +  
##     (1 | Word)
##    Data: lexdec
## 
##      AIC      BIC   logLik deviance df.resid 
##   -961.2   -912.5    489.6   -979.2     1650 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4091 -0.6154 -0.1101  0.4690  6.1777 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  Word     (Intercept) 0.002916 0.05400       
##  Subject  (Intercept) 0.033702 0.18358       
##           Frequency   0.000204 0.01428  -0.87
##  Residual             0.029173 0.17080       
## Number of obs: 1659, groups:  Word, 79; Subject, 21
## 
## Fixed effects:
##                            Estimate Std. Error t value
## (Intercept)                6.609232   0.049552 133.378
## Frequency                 -0.044834   0.006610  -6.783
## NativeLanguage1            0.286343   0.087308   3.280
## Frequency:NativeLanguage1 -0.027472   0.009158  -3.000
## 
## Correlation of Fixed Effects:
##             (Intr) Frqncy NtvLn1
## Frequency   -0.827              
## NativeLngg1  0.126 -0.081       
## Frqncy:NtL1 -0.103  0.099 -0.815

Interpreting

To interpret what the model means, look at the fixed effects– estimates, the error around them, and the t-value (ratio of estimate to error). Frequency, Native Language, and their interaction all have reliably large effects compared to the error of the estimate.

The overall intercept reflects the estimated (log) RT value when predictors are zero– since we did effects coding for Native Language, this would reflect the RT we’d expect for a word with zero log Frequency averaged across both Native Language conditions.

The effects here are main effects: the change in RT with a one-unit change in log Frequency or Native Language (the difference in average RT going from English to Other).

The interaction reflects that the different levels of Native Language have different slopes by Frequency: the Native Language = Other group has a steeper slope, because the negative contrast is for Other, and the estimate is negative. (We also know this by looking at the plot)

We also see info about the random effects. There is no estimate term for them: the estimate is defined as zero for random effects. But there is a variance term for each– each of our random effects is a group of points within the data that is systematically variable from the cell that the data belong to. There are also correlations between slopes and intercepts fit to the same group of data (here: subject)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: 
## RT ~ 1 + Frequency * NativeLanguage + (1 + Frequency | Subject) +  
##     (1 | Word)
##    Data: lexdec
## 
##      AIC      BIC   logLik deviance df.resid 
##   -961.2   -912.5    489.6   -979.2     1650 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4091 -0.6154 -0.1101  0.4690  6.1777 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  Word     (Intercept) 0.002916 0.05400       
##  Subject  (Intercept) 0.033702 0.18358       
##           Frequency   0.000204 0.01428  -0.87
##  Residual             0.029173 0.17080       
## Number of obs: 1659, groups:  Word, 79; Subject, 21
## 
## Fixed effects:
##                            Estimate Std. Error t value
## (Intercept)                6.609232   0.049552 133.378
## Frequency                 -0.044834   0.006610  -6.783
## NativeLanguage1            0.286343   0.087308   3.280
## Frequency:NativeLanguage1 -0.027472   0.009158  -3.000
## 
## Correlation of Fixed Effects:
##             (Intr) Frqncy NtvLn1
## Frequency   -0.827              
## NativeLngg1  0.126 -0.081       
## Frqncy:NtL1 -0.103  0.099 -0.815



Is it a good model?

Recall the same plots we made for the fixed-effect only regression: we can make them again here to diagnose model fit. Note that the syntax for the first model is different: we can take a shortcut to extract the data for lmer objects.

These plots indicate that this model is a lot better. Adding the random effects (and the extra fixed effects) has substantially improved model fit.



A lack of p-values…

There are no p-values associated with lmer models. This is because there is no closed-form solution to the model– all the random effects and fixed effects are modeled at once, meaning there is no well-defined notion of a ‘degree of freedom’, which is the missing piece to define p-values.

That’s ok, because you don’t need p-values.

Really.

All you need really is to know that the ratio of effect to error is reasonably large. This means you can decide that t>2 is sufficient evidence for your hypothesis (2x as much estimate as error). You can cite Baayen 2008 (the R book) as precedent.

Confidence intervals for parameters

Here’s another nice option: get a confidence interval on your parameters. If the 95% CI doesn’t cross zero, the parameter is reliably significant.

The ‘sig’ terms are the random effects: slopes, intercepts, correlations, and residuals.

(There is an error here in that the sig03 term– the correlation between random intercept for Subject and the random slope for Frequency by Subject– is reaching a boundary of -1. In this case, it doesn’t matter because I don’t want to draw inferences around the random effects.)



## Computing profile confidence intervals ...
## Warning in FUN(X[[i]], ...): non-monotonic profile for .sig03
## Warning in confint.thpr(pp, level = level, zeta = zeta): bad spline fit
## for .sig03: falling back to linear interpolation
##                                  2.5 %       97.5 %
## .sig01                     0.042439244  0.067877578
## .sig02                     0.130819932  0.267070018
## .sig03                    -1.000000000 -0.511665712
## .sig04                     0.005024811  0.024905203
## .sigma                     0.164939990  0.177015916
## (Intercept)                6.509144374  6.709318895
## Frequency                 -0.058025196 -0.031642958
## NativeLanguage1            0.107083623  0.465601438
## Frequency:NativeLanguage1 -0.046281746 -0.008662151

P-values by model comparison

But if you do decide you want them, here’s a way to get them: But, but, but….I want p-values! Ok, here’s a way of thinking about it that’s better defined for modeling: how much does including the factor improve my model? We can test this using model comparison. This work flow does a model comparison that is equivalent to using the type-III sum of squares in an ANOVA: drop out one factor at a time and test what it does.

This will only work for predictors that are coded as numerics– here, a dummy coded version of Native Language. So, re-run everything… using a numeric version of native language, coded with -.5, .5.

Because of convergence troubles with some of these models, I’ve also had to simplify the random effect structure.

P-values by model comparison

Next, we compare these models with a series of chi-squared tests… which you call using the command “anova”. Note the last column: this is the p-value associated with the model comparison that leaves the term out. This comes from a chi-squared test on the likelihood ratio of the two models using a chi-squared function with one degree of freedom (one parameter is different). The p-value for this number is very small, meaning that adding frequency relably improves model fit more than would be expected due to chance.

## Data: lexdec
## Models:
## mem20: RT ~ 0 + Frequency * NL2 + (1 | Subject) + (1 | Word)
## mem02: RT ~ 1 + Frequency * NL2 + (1 | Subject) + (1 | Word)
##       Df     AIC     BIC logLik deviance  Chisq Chi Df Pr(>Chisq)    
## mem20  6 -442.78 -410.30 227.39  -454.78                             
## mem02  7 -954.47 -916.57 484.24  -968.47 513.69      1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Data: lexdec
## Models:
## mem21: RT ~ 1 + NL2 + Frequency:NL2 + (1 | Subject) + (1 | Word)
## mem02: RT ~ 1 + Frequency * NL2 + (1 | Subject) + (1 | Word)
##       Df     AIC     BIC logLik deviance  Chisq Chi Df Pr(>Chisq)    
## mem21  6 -912.07 -879.59 462.03  -924.07                             
## mem02  7 -954.47 -916.57 484.24  -968.47 44.403      1  2.673e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Data: lexdec
## Models:
## mem22: RT ~ 1 + Frequency + Frequency:NL2 + (1 | Subject) + (1 | Word)
## mem02: RT ~ 1 + Frequency * NL2 + (1 | Subject) + (1 | Word)
##       Df     AIC     BIC logLik deviance  Chisq Chi Df Pr(>Chisq)    
## mem22  6 -941.75 -909.26 476.87  -953.75                             
## mem02  7 -954.47 -916.57 484.24  -968.47 14.726      1  0.0001243 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Data: lexdec
## Models:
## mem23: RT ~ 1 + Frequency + NL2 + (1 | Subject) + (1 | Word)
## mem02: RT ~ 1 + Frequency * NL2 + (1 | Subject) + (1 | Word)
##       Df     AIC     BIC logLik deviance  Chisq Chi Df Pr(>Chisq)    
## mem23  6 -939.69 -907.20 475.84  -951.69                             
## mem02  7 -954.47 -916.57 484.24  -968.47 16.784      1  4.189e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-values by model comparison

Note: there are other ways of getting p-values.

You can use a Satterthwaite approximation to get degrees of freedom for your t distribution, and get p-values that way (https://cran.r-project.org/web/packages/lmerTest/lmerTest.pdf).

Or you can do a type-II model comparison (remove all higher-order terms that effect participates in).

(Or you can just…not do it, seriously, it’s ok.)

GLMER models: repeated measures for binomial outcomes

Mixed models also work for GLMs– next, let’s assess error rates (correct, incorrect) instead of RTs! Note that we don’t have to re-set contrasts for NativeLanguage because we did that above.

This is very straightforward, using a similar syntax as above, but with a ‘g’ and family=‘binomial’ call (as with glm).

I started out with a somewhat complex random effect structure– the largest justified by the design of the study– and simplified due to non-convergence.

## this model doesn't converge -- remove NativeLanguage by Word slope because it (1) accounts for very little variance, and (2) is perfectly correlated with the Word intercept.
##glmer1 <- glmer(Correct ~ Frequency*NativeLanguage + (1 + Frequency | Subject) + (1 + NativeLanguage | Word), data=lexdec, family='binomial')

## this model converges but is still overfitted-- the Frequency by Subject slope is still highly correlated with Subject intercept, so remove.
##glmer1 <- glmer(Correct ~ Frequency*NativeLanguage + (1 + Frequency | Subject) + (1 | Word), data=lexdec, family='binomial')

glmer1 <- glmer(Correct ~ Frequency*NativeLanguage + (1 | Subject) + (1 | Word), data=lexdec, family='binomial')

GLMER models: repeated measures for binomial outcomes

After removing some random slopes (common for glmer models), we have a fitted model.

We can see that because the intercept is negative, the 0 value (by default, again the first alphabetically= correct) is more likely.

Frequency has a significant effect on likelihood of incorrect responses– higher frequency is associated with more correctness, as indicated by the negative value.

Because this is a glmer, we also use a (Wald) z-value to compare estimate to error, and then, we get a p-value associated with our fixed effect terms. This ype of p-value is ‘anti-conservative’ (too liberal) but an ok approximation that we get for free.





## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: Correct ~ Frequency * NativeLanguage + (1 | Subject) + (1 | Word)
##    Data: lexdec
## 
##      AIC      BIC   logLik deviance df.resid 
##    497.1    529.6   -242.6    485.1     1653 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.0664 -0.1801 -0.1237 -0.0882  8.3313 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  Word    (Intercept) 1.0404   1.0200  
##  Subject (Intercept) 0.6578   0.8111  
## Number of obs: 1659, groups:  Word, 79; Subject, 21
## 
## Fixed effects:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.5123     0.7364  -2.054 0.039998 *  
## Frequency                  -0.5524     0.1555  -3.553 0.000381 ***
## NativeLanguage1             1.8231     0.9919   1.838 0.066073 .  
## Frequency:NativeLanguage1  -0.3119     0.2175  -1.434 0.151654    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Frqncy NtvLn1
## Frequency   -0.889              
## NativeLngg1 -0.036  0.015       
## Frqncy:NtL1  0.027  0.011 -0.882

GLMER models: repeated measures for binomial outcomes

To assess model fit, we can again plot the distribution of residuals and the qqplot. They will tend to be less pretty than you’d see in a linear model, which relates to the shape of the log odds distribution we are comparing to (steep slope in the middle, shallow on the sides), and the different meaning that ‘residual’ takes here– these reflect not some error term in the model, but the model predicting “correct” when it should have predicted “incorrect”.

But, still informative… these show that we have some points with very positive values that we aren’t accounting for as well. These are unexplaned by our model at present!








GLMER models: repeated measures for binomial outcomes

One last set of plots to make… what does the effect look like from our model? For glmer models (and for complicated lmer models), I like to extract the effects using the ‘effects’ package

## NOTE: Frequency is not a high-order term in the model

More about interactions in mixed models

More complex interaction terms are something that is often new to people when moving away from ANOVA.

This is pretty straightforward in (g)lm and (g)lmer models– in fact, you don’t have to do post-hoc comparisons.

More about interactions in mixed models

First tip: Interpreting multi-way interactions is really just like intepreting two-way interactions, scaled up.

-> As a heuristic, if there is an interaction effect– there is often just one point that is different from what you expect. Plot your numbers, and find it. This will help guide you in reading your output.


Second tip: But, interpretations of main effects depend a lot on contrasts.


Third tip: There will be MORE CONTRASTS for variables with more levels– but you can use this to your advantage.

Apply this in an example.

I am basing this off of the cake data in lme4. This was a food science experiment where some scientists made snack cakes (like Twinkies) and broke them in half to see if the new recipe had the properties of the original: here, breakage angle, which is a proxy for denseness/moistness of cake.

I took the numbers from the original data and re-combined, and resampled, to make a useful teaching example. Details below…

Cake analysis setup

I am going to set up effects contrasts for temperature and time (compare time 1 to time 2; compare temp 1 to temp 2; make the effects of each other variable reflect main effects).

For batch, I’ll use a coding scheme called Helmert, where we compare (Batch A and B) to batch C in one contrast, and compare A and B in the other.

Because of these coding schemes (zero is in the middle), this means that the main effects in this model are actually main effects.

##    [,1]
## t1   -1
## t2    1
##     [,1]
## 175   -1
## 225    1
##   [,1] [,2]
## A   -1   -1
## B    1   -1
## C    0    2

Cake analysis

Fit a model. Note the three-way interaction, involving a three-level variable. (eek! really, don’t try to run anything more complex than this)

m1 <- lmer(angle ~ recipe*temperature*time + (1|batch), data=cake2)

Unpacking the model

Take t > |2| to reflect reliable effects. We observe…

## Linear mixed model fit by REML ['lmerMod']
## Formula: angle ~ recipe * temperature * time + (1 | batch)
##    Data: cake2
## 
## REML criterion at convergence: 575
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.43768 -0.47885 -0.05524  0.41175  2.31054 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  batch    (Intercept) 34.50    5.874   
##  Residual             22.93    4.788   
## Number of obs: 90, groups:  batch, 45
## 
## Fixed effects:
##                            Estimate Std. Error t value
## (Intercept)                 33.5759     1.0130  33.146
## recipe1                     -0.4821     1.2406  -0.389
## recipe2                      1.7723     0.7163   2.474
## temperature1                 5.3824     0.5059  10.640
## time1                       -1.9717     1.0130  -1.947
## recipe1:temperature1         0.6473     0.6195   1.045
## recipe2:temperature1         1.7708     0.3577   4.951
## recipe1:time1               -0.5179     1.2406  -0.417
## recipe2:time1                1.3318     0.7163   1.859
## temperature1:time1           1.2634     0.5059   2.498
## recipe1:temperature1:time1  -0.4598     0.6195  -0.742
## recipe2:temperature1:time1   1.6875     0.3577   4.718
## 
## Correlation of Fixed Effects:
##             (Intr) recip1 recip2 tmprt1 time1  rcp1:tmp1 rcp2:tmp1
## recipe1      0.000                                                
## recipe2      0.000  0.000                                         
## temperatur1  0.000  0.000  0.000                                  
## time1       -0.067  0.000  0.000  0.000                           
## rcp1:tmprt1  0.000  0.000  0.000  0.000  0.000                    
## rcp2:tmprt1  0.000  0.000  0.000  0.000  0.000  0.000             
## recipe1:tm1  0.000 -0.067  0.000  0.000  0.000  0.000     0.000   
## recipe2:tm1  0.000  0.000 -0.067  0.000  0.000  0.000     0.000   
## tmprtr1:tm1  0.000  0.000  0.000 -0.067  0.000  0.000     0.000   
## rcp1:tmp1:1  0.000  0.000  0.000  0.000  0.000 -0.067     0.000   
## rcp2:tmp1:1  0.000  0.000  0.000  0.000  0.000  0.000    -0.067   
##             recp1:tm1 recp2:tm1 tmp1:1 r1:1:1
## recipe1                                      
## recipe2                                      
## temperatur1                                  
## time1                                        
## rcp1:tmprt1                                  
## rcp2:tmprt1                                  
## recipe1:tm1                                  
## recipe2:tm1  0.000                           
## tmprtr1:tm1  0.000     0.000                 
## rcp1:tmp1:1  0.000     0.000     0.000       
## rcp2:tmp1:1  0.000     0.000     0.000  0.000

Unpacking the model

There is a three-way interaction between recipe–2nd contrast, temperature, and time. This means: the difference between temperature and time (2-way interaction) differs when comparing level C to the combination of the other two levels.

This is to say: when you look at the plot, there’s a two-way interaction visible in the A and B panels. There’s also a two-way interaction visible in the C panel– and it’s a different one. Specifically, the last yellow point goes up, rather than down.





More about mixed models

To learn more about mixed models: http://www.bodowinter.com/tutorial/bw_LME_tutorial2.pdf

For trouble-shooting: https://rstudio-pubs-static.s3.amazonaws.com/33653_57fc7b8e5d484c909b615d8633c01d51.html

And feel free to use my own best-practices guide, which I have put in the folder for this course.